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INTRODUCTION 



Suppose we wish to choose the 'best' model from a 
set of theoretical models (or theories) of a natural phe- 
nomenon with the aid of the relevant empirical data. How 
can we objectively accomplish this goal? Many statist i- 
cal techniques suc h as hypothes is testing (|Fisherl Il925l ). 
Bayesian evi dence (|Jeffrevs!ll96ll ). Akaike information cri- 
terion (AIC) {A kaike 1974|), Bayesian Information Criterion 
(BIC) l|Schwarz 19781') . minim um description length (MDL) 
ijBarron. Rissanen fc Yulll998l ) , etc. have been developed to 
address the question of model selection. 

The late time acceleration of expa nsion of th e uni- 
verse has been firmlv_ established 



ABSTRACT 

The Akaike Information Criterion (AIC) has been used as a statistical criterion to 
compare the appropriateness of different dark energy candidate models underlying 
a particular dataset. Under suitable conditions, AIC is an indirect estimate of the 
Kullback-Leibler divergence D(T\\A) of a candidate model A with respect to the truth 
T. Thus, a dark energy model with a smaller AIC is ranked as a better model, since it 
has a smaller Kullback-Leibler discrepancy with T. In this paper, we explore the impact 
of statistical errors in estimating AIC during model comparison. Using a parametric 
bootstrap technique, we study the distribution of AIC differences between a set of 
candidate models due to different realizations of noise in the data and show that the 
shape and spread of this distribution can be quite varied. We also study the rate of 
success of the AIC procedure for different values of a threshold parameter popularly 
used in the literature. For plausible choices of true dark energy models, our studies 
suggest that investigating such distributions of AIC differences in addition to the 
threshold is useful in correctly interpreting comparisons of dark energy models using 
the AIC technique. 

mentary data sets and a definite framework for model 
selection. Several large new surveys such as DE^I, BIGB0SS 
Q LSST EUCLIE0 have been planned to st udy this late 
time acceleration by collecting more data ([Abbott et al.l 
20051; ILSST Science Collaborations fc LSST ProiectJ 120091; 



Schlegel et ail |2009h . Of course, even with accumulation 
of more quality data, the importance of analysing the 
model selection process will not diminish, because reliable 
discriminating methods can always allow us to exploit the 
available data maximally. 



Hence, statistical techniques 



model selec- 
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20ld ; I Blake et al.l 1201 ll ). but there is no consensus 
on the physics behind this phenomenon. A number 
of possible explanations such as a small positive cos- 
mological constant or vacuum energy, an otherwise 
unobserved dynamical fi elds usual l y call e d dark energy 
jRatra fc Peebles! Il988l; IWetterichl Il988l ; iFrieman et ail 
Il995l ; IZlatev. Wang fc Steinhardtlll999l). or a modification 
of Gene r al Re l ativity dDvali. Gabadadze fc Porratil |2000| ; 
iDeffavetl l200ll ; ICarroll et all 120051 ) have been proposed 
as an explanation. With many models still consistent 
with current data, it is clear that further progress of 
the field requires the collection of larger and comple- 



tion have been applied to this context (|Liddlel 


2004; 


Liddle et all 20061; Szvdlowski & Godlowskil 


2006; 


Szvdlowski. Kurek & Krawied 
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2007; 
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the use of information criteria like AIC and BIC which are 
easy to calculate. A number of works have applied this to 
data. I n particul a r. ISzydlowski. Kurek fc Krawied (|2006l ) as 
well as iBiesiadal (|2007T ) used the method of AIC and BIC, 
and a compilation of SNIa to compare various late time 
acceleration cosmological models. 



1 http : //www . darkenergysurvey . org/ 

2 http://bigboss.lbl.gov/ 
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4 http://scl.esa.int/euclid 
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iDavis et all < |2007f ) and ISollerman et al.l (|2009l ) compared 
such models based on the Sloan Digital Sky Survey 
ijKessler et all l2009h and Equation of State: Supernovae 
Trace Cosmic Expa nsion(ESSENCE) S upernova data and 
high-redshift data (|Riess et al.l |2004| ; IWood-Vasev et al] 
|2007T > along with a summary of cosmic microwave back- 
ground (CMB) and Baryon Accoustic Oscillations (BAO) 
data using AIC and BIC. AIC has also been used in com- 
paring principal component-based models for dark energy 
(|Zhao et al]|2007l ). We wish to explore the use of AIC in the 
context of selecting the 'best' theoretical model describing 
the late time acceleration of the universe. 

This paper will focus on the use and validity of apply- 
ing the AIC technique to choose the 'best' late time ac- 
celeration model from SNIa data. For our ca lculation, we 
use da ta from the Constitution compilation jHicken et al.l 
1200901) of the Cen ter for Ast rophysics (CFA3) sa mple 
(|Hicken et aLll2009al ) ESSENCE dMiknaitis et aJ.ll2007h . Su- 
pernova Legacy Survey (SNLS) (lAstier et al.l 120061 ) and 
'High-z' samples (|Riess et alj 120071 ). We will study and re- 
fine the AIC methodology in this context. In particular, we 
wish to study its reliability when subjected to estimator un- 
certainty. With the AIC technique, the usual approach has 
been to use a single number (called AIC) to rank models; 
the smaller the AIC the better the model. T his concept of 
discrete ranking of models h as been extended (|Akaikell 19831 ; 
iBurnham fc Anderson|[2004l ) by paying attention to the dif- 
ferences between actual AIC values. However, it is important 
to note that the AIC values themselves are empirically esti- 
mated from the data and thus have statistical uncertainties. 
Therefore, the reliability of the estim ates of AIC differ ences 
is a crucial issue. While there is work l|Shimodairall 19971 ) that 
tries to address the reliability of the AIC technique through 
analytic calculations, we will instead study the reliability 
through numerical simulations. 

The paper is organised as follows. In Section [2] we briefly 
review the information theoretic origin of AIC, extending 
this idea further, by reviewing the theoretical implications 
of the AIC differences in Section[3] In Section|4l we study the 
estimator uncertainty of these AIC differences in the context 
of cosmological model selection. This involves a comparison 
of a set of four candidate models by us ing SNIa, as in the 
work by ISzvdlowski fc Godlowskil i|2006h . The reliability of 
these AIC estimates are c alculated via a bootstrap method 
( Efron fc Tibshiranllll993h . and this will be used to evaluate 
the validity of the model selection technique. The results of 
these simulations and the conclusion will then be given in 
Sections [5] and [6j respectively. 



2 THE AKAIKE INFORMATION CRITERION 
(AIC) 

The Kullback-Leibler (KL) divergence (| Cover fc Thomas! 
Il99ll ) is a commonly used quantity that measures the dis- 
crepancy of one probability distribution with respect to 

5 Note that a KL divergence is a non-symmetric measure that 
does not obey the triangle inequality. 



another probability distribution. We denote the KL diver- 
gence of a candidate model A (which gives the probability 
distribution pa) with respect to the truth T (with probabil- 
ity distribution pr) as D(T\\A), where we have defined T as 
an underlying process consisting of a signal with stochastic 
noise; a particular empirical datum is a single realisation of 
this process. If we denote X as the set of all possible out- 
comes that can be generated by either A or T, and a; as an 
element in this set, we can define D(T\\A) as 

D(T\\A)= [ dxp T (x)\og^\. 

Jx&x Pa{x) 

In this paper, we define a model class as the totality of model 
probability distributions with the same parametric form 
(but with different parameter values). Within each model 
class, there is a set of parameters (the 'best' model) that 
gives the lowest KL divergence with respect to the truth. 
Thus, to choose the 'best' model class we must first choose 
the 'best' model (parameter set) from a particular model 
class as the representative of the model class. This is done by 
the maximum likelihood criterion. The model class selection 
strategy is thus obtained by comparing the KL divergence 
of the representative models of the individual model classes. 
However, the truth is unknown a priori, so for a given repre- 
sentative model A, D(T\\A) cannot be evaluated directly. 
We can solve this problem by computing the AIC value 
|Akaikelll974l ) which is an asymptotically unbiased estima- 
tor for D(T\\A) (up to a fixed offset that is independent of 
the representative models). Since the fixed offset is indepen- 
dent of the models, and hence the choice of model classes, 
a comparison of the AIC values is a useful surrogate for the 
strategy of comparing the associated KL divergence of the 
different candidate model classes. We can compute the AIC 
of the representative models of va rious model c lasses, which 
in the asymptotic limit is known (Akaikc 1974) to be 

AIC = 2fe-21og(L M i), (1) 

where La/z, denotes the likelihood evaluated at the set of 
model parameters that maximize the likelihood L, and k is 
the number of free parameters in the candidate model class. 
When we make a further assumption that the distribution of 
errors follows a Gaussian distribution, this further reduces 
to 

AIC = 2k + Xml, (2) 

where Xml i s the usual chi-square evaluated at the maxi- 
mum likelihood estimate of the model parameters. The AIC 
values are subsequently ranked by the smallness of their val- 
ues; the model class with the smallest value is determined 
to be the 'best' model class. 

It is worth noting that the AIC estimate (Eqn. [2]| of the 
KL divergence assumes that the number of data points is 
sufficiently large. AIC in the form written above is an unbi- 
ased estimator for large data sets. For smaller data sets, the 
2k term can be corrected by an additional 2 ^f ^+1) term to 
approximately correct for the bias due to finiteness of the 
dataset, where N is the number of data points in a single 
data set. While further studies to obtain a more accurate 
expression for this term are possible, for the cases we shall 
consider, this correction is always less than 0.06 (which will 
be seen to be negligible for our purposes) and will only de- 
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crease in importance when more data is collected. We shall 
therefore ignore this correction altogether in this paper. 

As already mentioned explicitly, the AIC procedure for com- 
paring different model classes actually compares the \ 2 f° r 
the best fit representative model from each model class. 
These representative models are derived from the maximum 
likelihood estimate of the model parameters for each of the 
model class. Thus, in the rest of this paper, when we refer to 
comparing model classes, the calculation only involves the 
X 2 of the (best) representative model taken from its respec- 
tive model class. We sometimes refer to this as comparing 
models. 



3 MODEL COMPARISON AND AIC 
DIFFERENCES 

Since AIC is essentially the measure of discrepancy of a 
model from the truth, it is intuitively obvious that the 
smaller the AIC difference between two models, the harder it 
becomes to judge which model is better; even if the AIC esti- 
mate of this difference in the KL divergence can be obtained 
without an estimator error, the small difference would make 
it difficult to tell the two probabilistic models apart for a 
small number of observations. Hence, there is a need to as- 
sociate a confidence level for distinguishing between a model 
A and another model B using the AIC difference between 
them. 

Let P(M) be the probability that model M is true. We wish 
to relate P(A)/P(B) to the AIC difference 

A a ,b = AIC(A) - AIC(B) « 2 [D(T\\A) - D(T\\B)} , 

where w implies the asymptotic relation. This asymptotic 
relation may not necessarily be realised as the number of 
sample s is sm all. Using a certain set of extra assumptions, 
lAkaikel (|l983h showed that for a pair of models A and B, 



P(A)/P(B) « exp[-A A ,B/2]. 



(3) 



Since the AIC differences are estimates of the differences 
of KL divergences, one can also use the idea of distin- 
guishing probability distributions to justify Eqn. [3] when 
the number of samples are sufficiently large (Appendix [A"| . 
Assuming the truth to be in the set of candidate models, 
iBurnham fc Andersonl l|2004l ) extended Eqn.[3]to obtain the 
probability Wi of model i by appropriately normalising the 
equation: 



exp[-Aj,i,/2] 
£?exphAW2] : 



(4) 



where the candidate models are numbered as i = 1, • • • , n 
and b denotes the best model which has the smallest AIC 
value among all the candidate models. Either way, Eqn. [3] 
quantifies the intuitive idea that it is easier to select a model 
over another if the AIC difference is large. 

One can use this idea to modify the AIC methodology de- 
scribed above and suppress the probability of obtaining in- 
correct results by introducing a threshold Athreshoid- Then, 
rather than ranking all models according to the smallness of 
their AIC values, one adopts the procedure where a model A 
is ranked to be better than model B if Aa.b < —^-threshold, 



while any two models with an AIC difference smaller than 
^threshold are considered of equal rank. Eqn. [3] shows that 
choosing a large enough value of the threshold ^threshold 
implies a high probability that the selected model is truly 
the better one. However, a large value of Athreshoid also in- 
creases the number of model pairs where the AIC differences 
are in the range —Athreshoid < Aa.b < Athreshoid- Since 
this procedure cannot discriminate between such models, 
we shall call such a model selection result indeterminate. In 
our convention, we also define the converse of the indeter- 
minate case as the determinate case (|Aa,b > Athreshoid)- 
Of course, for a pre-determined choice of Athreshoid (cor- 
responding to a predetermined confidence level), a better 
dataset gives a smaller fraction of model pairs which have 
indeterminate results. As a rule of thumb, a universal value 
of the threshold Athreshoid = 5, without any regard to the 
propertie s of the mode ls under comparison, has been men- 
tioned bv lLiddld (|2007h as the minimum AIC difference be- 
tween two models needed to make a 'strong' assertion that 
one model is better than the other. Such a definition has 
been used extensively in the literature. 



4 IMPACT OF AIC UNCERTAINTIES IN 
FINITE DATA SETS 

In the preceding section, we have discussed the AIC 
differences Aa,b without any regard for the fact that AIC 
is a statistical estimate. The associated uncertainty in the 
AIC estimate may not be negligible and may be dependent 
on the realisation of noise in a particular data set. Thus, 
there must be a statistical uncertainty in the value of Aa,b 
f even when estimated from a data set of similar quality 
J. The ensemble of such observations defines an empirical 
probability distribution of Aa,b, and the particular value 
of Aa,b obtained from the current SN data sets is actu- 
ally a sample value drawn from this probability distribution. 



Ideally, we should be able to study the probability 
density distribution of Aa,b under repeated observations 
of results with sample size TV: P(Aa,b\En), where En 
denotes a collection of observation data which individually 
consists of TV observation points and are drawn from 
the underlying truth process. Because producing a large 
subset of jfJjv is imp ossible, in this paper we i nstead use a 
bootstrap approach l|Efron fc Tibshiranilll993T ) to generate 
'mock' empirical data sets and estimate the probability 
distribution of Aa.b- 

Perhaps the most frequently used method to produce 
'mock' empirical data is t he bootstrap method proposed by 
lEfron fc Tibshiranil (| 19931 ). When we apply this approach to 



6 This is due to a statistical uncertainty in the AIC values coming 
from the variation of Xur ■ However, the uncertainty in the AIC 
values of the models can be correlated, and turns out to be larger 
than the uncertainty in the AIC differences. 

7 In this paper, two Supernovae data sets are said to have the 
same quality when they have the same number of data points, 
the same set of redshift z values and the same set of error 
bars(standard deviation) that corresponds to the set of z values. 
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Table 1. Different model classes considered in this paper: We show the evolution of the Hubble function H (z) with rcdshift z, the Hubble 
constant Ho and other free parameters in the models, k is the number of free parameters. The respective AIC values were obtained from 
the Constitution compilation. 



Model 


H(z)/H 


Free Parameters 


fc 


AIC 


ACDM 


^m(l+2) 3 + (l-S5m) 




1 


401.35 


wCDM 


^n m (i + *)3 + (i - n m )(i + z)3(i+«o 




2 


403.05 












CPL 


yjn m (i + z)3 + (i - n m )(i + 2)3(™ +»a+i) exp [^^] 




3 


404.66 


DGP 


Qrc = (i - n m ) 2 /4 




1 


401.13 



regression models, we need a probability model that speci- 
fies the distribution of residuals (eg. Gaussian distribution) . 
In our case in particular, the distance modulus fi is related 
to the red shift z, so a regression model has the following 
structure /i = f(z) + e, where e is the residual (the error 
term). Since we do not know the true relation /, we cannot 
obtain the purely empirical distribution of the residuals 
0. Thus, for regression models, the standard bootstrap 
method always involves using a model-dependen t probabil- 
ity distribution of residuals. iDavis et al.l (|20071 l extended 
this method to compare between two regression models and 
find the standard deviation of their BIC differences; we 
will further extend this idea and show that studying the 
structure of the distribution of AIC differences is important. 

When we wish to check the reliability of a particular 
model, we could use the probability distribution of residuals 
based on the model itself. Since we cannot have any model- 
free bootstrap data, we must choose a particular model to 
produce the residual distribution. Therefore, to estimate 
Aa,b, we need some model C as a reference probability 
model that is used to generate the bootstrap data. Let us 
denote the estimate of Aa,b based on a data set d as B . 
We wish to produce {/S. d A B \d £ Cjv}, where Cjv denotes a 
collection of parametric bootstrap data generated by model 
C, which individually consist of N observation points with 
the same data quality as our empirical data. Although the 
needed details are in Appendix [B] our approach may be 
outlined as follows. 

Suppose we have a single data set consisting of N ob- 
served results {(aii, /ui, <ri), (#2, A*2, <r 2 ), {z N ,fi N , a N )}, 
where the z, values denote the i th observed redshift, fa the 
i th generated distance modulus, and Oi the observed error 
bars of fii. We can create a bootstrap sample Cjv consisting 
of N observation points based on model C . Cn relates the 
same set of coordinates by the relation: fii = f(zi) + e;, 
i = 1, ■ ■ ■ , N, where £j is a stochastic term obeying a normal 

8 Note that these errors which include light curve fitting errors, 
intrinsic dispersion and peculiar velocity corrections are also re- 
quired for calculating quantities like x 2 f° r most model selection 
schemes. Thus, an underlying assumption of the application of the 
AIC technique as in previous works is that these error estimates 
are correct. Since our focus is on the statistical uncertainties in 
AIC after following other underlying assumptions used in the lit- 
erature, we also assume that these error estimates are correct. 



distribution of mean and standard deviation of: N(0, of), 
and f(z) is the maximum likelihood estimate of model C 
(ACDM, DGP, etc.) (Appendix 0. 

As mentioned above, we wish to simulate the distri- 
bution of {A A B \d G Cjv} as a proxy for P(Aa,b\En)- To 
proceed, we choose a subse t of models that have been often 
studied in the literature (|SzvdIowski fc Godlowskl l2006t ) 
and are listed in Table. Q] We also assumed that the universe 
is spatially flat and set the curvature term Qk in the Hubble 
function H(z ) to zero. Three of these models AC DM, 
wCDM, CPL (|Chevallier fc Polarskill20oi lLinderll2003r i are 
dark energy models with different parametrisations of the 
equations of state w(z) = p(z)/p(z), where p(z) and p(z) 
are the pressure and density of dark energy, respectively. 
These models are nested: setting w a — in the CPL model, 
we obtain the wCDM model; setting wo = —1 in the latter 
gives the ACDM model. We also use the flat DGP model 
which is a modified gravity model and cannot be nested in 
the previous classes of models. 

To make contact with observational data, we choose 
candidate models with parameter sets that are 'best' for 
the Constitution compilation of SNIa data (f(z) being 
the maximum likelihood estimation derived model), where 
we use the distance moduli and the error bars in data 
(Appendix [Cj. Since we need to find estimates of cos- 
mological parameters for different models by maximising 
likelihoods, we use the results from the mor e appropriate 
MLCS light curve fitter (for R v = 1.7) in (|Hicken et al.1 
2009b). The details of finding the maximum likelihood and 
the corresponding AIC value is given in Appendix [C] 371 
SNIa events were used. The Hubble function, as a function 
of cosmological parameters in each of these models, along 
with the free parameters and AIC values (calculated from 
the Constitution compilation) are shown in Table. [T] 

For the nested models, the parameter values that fit 
the data best turn out to be close to the ACDM model. 
Consequently, one does not gain much in terms of a lower 
Xmlj while the extra free parameters are penalised to 
give higher values of AIC for wCDM and CPL. The DGP 
model gives the best AIC value, which is only slightly 
better than the ACDM model. It is known that the 
simultaneous use of Co smic Microwave B ac kground (CMB) 
data from WMAP (jPavis et al.1 120071 : ISollerman et al.1 
120091) and Large Scale Structure (LSS) data disfavours 
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the DGP model compared to ACDM, since the parameter 
subspaces that provide the best fits for CMB, LSS and 
SNIa data do not overlap as much as in the ACDM model. 
We checked that our results are consistent with this, but 
will ignore the CMB and LSS data to focus on methodology. 

In order to calculate the distribution of A^ B , we adopt 
the best fit model in a model class C as a reference model 
to produce 5000 mock data sets of C371 which is expected 
to be similar in quality to the Constitution compilation 
of Supernova data. All the model classes in Table. [1] are 
successively chosen as the reference model C. Following 
our definition of 'similar quality', each simulated data set 
has the same set of redshift values and error bars as the 
Constitution compilation, while their apparent magnitudes 
are those expected from a noisy realisation of the reference 
model. The basic steps are: 

(i) We produce a mock data set consisting of 5000 real- 
isations of d £ C371 for a reference model C as outlined 
above. 

(ii) Candidate models A and B axe fitted to d £ C371 
by maximising the likelihood and the AIC values of these 
models A and B are computed through Eqn. [21 

(iii) Thus, for each element d £ C371 we can make the 
AIC difference A d 4jS . We study {A% B \ d £ C371} for the 
reference model C by plotting a histogram. 

We should note that the probability distribution of A^ fl is 
due to errors introduced by the stochastic noise term, but 
ignores the effect of the uncertainties in the cosmological 
parameters of the reference model C. 



5 BOOTSTRAP STUDY RESULTS 

We first consider the issue of statistical self-consistency in 
the following sense: When the reference model is C, does 
the bootstrap AIC method outlined above choose model C 
as a better model than the rest? The distribution of the 
values of Aq a for d £ Cjv can tell us about the statistical 
self-consistency. We start with the case when Athreshoid = 0. 
If P(A/C) = P({d £ C N \A d CyA < 0}) is larger than some 
predefined proportion, which must not be less than 1/2, 
we may say model C is better than A when C is the 
reference. The P(A/C) results for this case is summarised 
in Table. [21 The P-values0for the table are extremely small. 

From Table, d we obtained A d DGPyCPL < 0, 89% 
of the time, when DGP model is the reference (i.e., 
P(CPL/DGP) = 0.892), and 83% of the time, when CPL 
model is the reference (i.e., 1 - P(DGP/CPL)) = 0.836). 
This means that, when we use a threshold Athreshoid = 0, 
the DGP model is significantly favoured over the CPL 
model even if the CPL model is the reference. This means 
that the AIC method is statistically inconsistent when it 
is applied to the CPL model, and would be automatically 

9 The highest P- value found was 3 X 10 -8 for Ajjg P AC DM - Most 
of the P-values found were at the level of machine precision. 



X\Y DGP ACDM wCDM CPL 

DGP - 58 % 84 % 89% 

ACDM 59% 85% 90% 

wCDM 16% 21% - 90% 

CPL 14.0% 16% 17% 

Table 2. Percentage of cases where the AIC method with a 
threshold ^threshold = selects the correct model over other 
candidate models considered. We defined the difference and per- 
centage using the following convention. If wc define the bootstrap 
data as being produced by the reference model X (rows) and 
we are comparing it against model Y (columns), the difference 
is defined as the AIC value of X minus the AIC value of Y and 
is denoted by the symbol A^- Y . The table counts the percent- 
age of negative A^ Y 1 d £ ■ Note that a value greater than 
50% indicates that the correct model is chosen a majority of the 
time. We use this cut-off of 50% or more as a simple definition 
for statistical self-consistency. 



disqualified under the current number of data points in the 
sample. In another example, we compared the DGP and 
ACDM models in the table and noticed Aj^cdm dgp < 0, 
59% of the time when ACDM model is the reference (i.e., 
P(DGP/ACDM) = 0.594), and 42% of the time when DGP 
model is the reference (i.e., 1 - P(ACDM/DGP) = 0.422). 
This means that if the reference model was either ACDM or 
DGP and we apply AIC to compare between them using a 
zero Athreshoid, AIC will only slightly favour the reference. 
The AIC technique is only statistically self-consistent (for 
a zero Athreshoid) when applied to compare DGP and 
ACDM while we cannot use AIC to self-consistently study 
the other models under the current level of observation 
quality. However, a test that gives the right answer 3 out 
of 5 times is unreliable, since we can only do a single 
empirical test from our actual data. ACDM and DGP 
cannot be distinguished significantly using either reference 
models. Thus, looking at both examples, we must conclude 
that there are insufficient data points to tell reliably the 
models apart using AIC when Athreshoid = 0. Another 
trend that results from the insufficiency of data points is 
that the AIC procedure tends to favour models with a 
smaller number of free parameters. The trend persists even 
when we later increase Athreshoid- This seems to indicate 
that the addition of extra free parameters does not signifi- 
cantly improve the \ 2 fit f° r the number of data points used. 

We next study the behaviour when the threshold pa- 
rameter is increased, Athreshoid — 2 and 5, corresponding 
to choices made in the literature to moderate and strong 
evidence. Unlike the previous Athreshoid = case, we 
have to consider the effect of indeterminate cases, which 
we defined in section[3]as the case when | Aa,b \ < Athreshoid- 

In order to study the reliability of the AIC technique 
at different value of the threshold parameter Athreshoid > 0, 
we analyse the probability of the selected model being 



6 M. Y. J. Tan and Rahul Biswas 



incorrect for different values of Athreshold- 
define the following: 



To do so, we 



fit 



rail 

J false 



Number of samples with |Aa,b < A t 



Number of samples 
Number of samples with Aa.b > Athreshold 



Number of samples 



(6) 



d8t Number of samples with A a ,b > Athreshold 

f alse Number of samplcs(l — /; n d) 

find is the fraction of cases where the AIC procedure has 
an indeterminate result for a given value of the threshold 
^threshold, so a high value of find reflects the inadequacy 
of the data to discriminate between the pair of models in 
question with a certain level of confidence for a relevant 
^threshold- Using A as the reference model, ffalse is the 
fraction among all cases, where the AIC procedure results 
in an incorrect model selection, ffalse is the fraction among 
determinate cases (|Aa,b > Athreshold) where the AIC 
procedure results in an incorrect selection, and reflects the 
ratio of correct to incorrect model selections. Our results 
are summarized in Tables. [3] and 2] In each table, the 
rows correspond to different reference models X, while the 
columns list candidate models Y that were compared with 
the reference model. 



of free parameters than the reference model, ffal se tends 
to gently increase before steeply decreasing at a certain 
value of ^threshold- From Fig. (TJ we can see that this 
sharp decline happens at roughly twice the difference in 
the number of free parameters between the candidate and 
reference models. We also study these quantities for the 
case where the reference model has less parameters than 
a candidate model in Fig. [2] In this case, ff^ se actually 
increases rapidly at the point where Athreshold is equal 
to the difference in the number of free parameters in the 
two models. This implies that if the model underlying the 
empirical data was similar to the reference models studied, 
it is improbable that the data set would provide an AIC 
difference for the considered models greater than a large 
threshold A threshold (eg. 5) as shown by f ind . However, 
if this dataset did yield a AIC difference larger than a 
predetermined Athreshold, it does not necessarily mean that 
the AIC selected model has a high probability of being the 
true underlying model. This is because ffalse for a given 
Athreshold depends on the model pairs being considered, 
suggesting that even having AIC differences larger than a 
specified threshold Athreshold does not guarantee reliability 
of the AIC selection process. 



For each pair of reference model X and candidate 
model Y, we show the fractions {f ind, f false, f false)- First 
we reconsider the Athreshold = case in Table. [2] By 
definition, the proportion of Athreshold = cases where the 
AIC method is not statistically self-consistent (1 — P(A/B)) 
is equivalent to both f false and f false- We note that the 
values of f false are large, indicating a high failure rate and 
an unsatisfactory procedure. 

We expect these failures to be suppressed when we 
choose higher values of the threshold Athreshold- When we 
increase the threshold Athreshold to 2 and then 5, we notice 
the expected suppression of f false- However, ffalse does not 
decrease as dramatically as ffai se - 

We study the behaviour of fff laR , ffl Ue and f md in 
Fig. [1] for different values of Athreshold as well as different 
choices of candidate and reference models; using bootstrap 
simulations. The ffal se values become dominated by noise 
as fi n d increases, since the calculation is made from an 
ever decreasing number of determinate bootstrap cases. 
Hence, the values of ffa\ se at large values of find should 
be ignored. Nevertheless, we can still study the regime for 
smaller find- It should also be noted that for the case of 
Athreshold — 5, the proportion of indeterminate results was 
high (Table. [3}. For example, when CPL is the reference 
model, the proportion of A^ Pi ,acdm and ^cpl,dgp 
between ±5 are both approximately 98%. 

We also note, as expected, that f ina - increases asymp- 
totically to one as we increase Athreshold- Increasing 
Athreshold monotonically suppresses ff a \ 3e , but not nec- 
essarily ffa\ ae . f fTise decreases with increasing A thresho i d 
for the cases where the DGP model was wrongly picked 
over the reference ACDM model. However, for the other 
cases, when the candidate model has a smaller number 



In order to intuitively understand what leads to these 
examples, it is instructive to consider the shapes of the 
distribution of AIC differences when comparing the ACDM 
and CPL models or the ACDM and DGP models. We note 
that a comparison of nested models will always involve 
strongly asymmetric, exponential-like distribution of AIC 
differences, while the comparison of non-nested models 
could result in almost symmetric distributions. This is 
because the x\il f° r the general case cannot be smaller 
than the more specific case. For example, ACDM is a special 
case of the more general wCDM and CPL models. Hence, 
when we fit these models against any data or simulated 
data (regardless of the reference model used to generate it), 
the wCDM and CPL xlzx values cannot be larger than the 
ACDM 

xIil value and are usually smaller. This results in 
a sharp edge along a maximum value of Aacdm,cpl and 
an exponential-like distribution of AIC differences between 
these models. 

We show examples of such a Gaussian-like distribu- 
tion in Fig. [5] and exponential-like distribution in Fig. U 
As an aside, we note that the different reference models 
considered do not make much difference in the shape of 
these distributions in the figures. Since AIC = Xml + 2fc, 
the edge of these exponential-like histograms is shifted by 
twice the difference between their number of parameters. 
Hence for the comparison between the CPL model and 
ACDM model in Fig. [1] and Fig. [2j where the difference 
between the xl/i is almost zero, the distribution of AIC 
differences Aacdm,cpl has a sharp edge at approximately 
4. This implies that any AIC difference greater than a 
Athreshold value of approximately 4 will exclusively select 
the CPL model, irrespective of whether the reference model 
used was a CPL model or a ACDM model. At lower values 
of Athreshold, the AIC selection procedure tends to select 
the model with the lower numbers of parameters since the 
Xml values are approximately the same. On the other 



The reliability of the AIC method in Cosmological Model Selection 7 



X\Y DGP ACDM wCDM CPL 



DGP 
ACDM 
wCDM 

CPL 



(97,1,24)% 
(87,9,71)% 
(31,64,92)% 



(97,1,26)% 

(90,1,11)% 
(32,61,90)% 



(86,5,32)% 
(94,4,66)% 

(94,0,0)% 



(25,4,5)% 
(24,4,5)% 
(96,3,79)% 



Table 3. Failure Rate for ^threshold = 2 : The values in the parentheses show fi n d, the percentage of total cases where the AIC 
procedure has an indeterminate result; /yJJ ise , the fraction of total cases where the AIC procedure results in an incorrect model selection; 
and ff^i se , the percentage of determinate cases where the AIC wrongly selects a candidate model Y over the reference model X. 



X\Y DGP ACDM wCDM CPL 



DGP - (100,0,"-")% (99,1,100)% (99,1,100)% 

ACDM (100,0,"-")% - (99,1,100 )% (99,1,100)% 

wCDM (100,0,0)% (99,1,0)% - (99,1,0)% 

CPL (99,0,0)% (98,0,0)% (99,0,0)% 



Table 4. Failure Rate for ^threshold = 5 : The values in the parentheses show fi n d, the percentage of total cases where the AIC 
procedure has an indeterminate result; /y„ ise , the fraction of total cases where the AIC procedure results in an incorrect model selection; 
and /^J se , the percentage of determinate cases where the AIC wrongly selects a candidate model Y over the reference model X. 



hand, there is no similar constraint relating the Xml values 
of the ACDM and DGP (at the best fit value); consequently 
the histogram turns out to be Gaussian-like. They have 
the same number of free parameters so this distribution is 
roughly centred about zero (the difference in Xail between 
these models). Obviously, if the quality of data was much 
better in terms of the error bars on each observation, or 
having a larger number of observations, the differences in 
the Xail terms would be much larger for the same choice of 
reference models used. In such cases, the model comparison 
would be purely data-driven, deriving its discriminatory 
power from the 'fit' term of AIC. This agrees with our 
intuitive idea that a better dataset should be able to to 
resolve models better. 

The shapes of the distribution of AIC differences are also 
important when we study the spread of the statistical 
uncertainty of the distribution of the AIC differences since 
the statistical spread of the distribution cannot be specified 
unless we know the shapes beforehand. Due to the struc- 
tural differences between the two kinds of distributions, we 
must define the 'error' bars according to the shape in order 
to make any useful comments about the uncertainty of the 
estimate. For the case of the Gaussian-like distribution, 
we define the error bars to be the standard deviation of 
the statistical distribution of differences. For the case of 
the exponential-like distribution, we define an error bar 
region as the range that begins at the sharp edge of the 
distribution and stops at the point where the range contains 



68.3% of the differences. We chose 68.3%, because 0.683 is 
approximately the probability of finding an outcome within 
a standard deviation of the mean of a Gaussian distribution. 

As an illustration, we look at statistical uncertainty 
of the distribution of Aacdm dgp f° r the two separate 
cases where the DGP and ACDM models are the refer- 
ences. The distribution of the differences in both cases are 
Gaussian-like. We notice that the standard deviation of 
Aacdm dgp i s 0-89 with DGP model as the reference and 
0.83 with ACDM model as the reference. The AIC difference 
between the DGP model and ACDM model Aacdm, dgp 
in our original AIC analysis is 0.22 and smaller than the 
error bars. This result means that any subsequent analysis 
based on the value of Aacdm,dgp observed is unreliable. 
The error bars could also be significant even if Aacdm, dgp 
was larger than the error bars as we would have to modify 
any subsequent AIC difference analysis to include this un- 
certainty. When we look at Aa C dm,cpl (exponential-like 
distribution), the error bar region ranges from —4.00 to 
—2.23 for the case when the reference is the ACDM model 
and ranges from —4.00 to —1.42 when the reference is the 
CPL model. The Aacdm,cpl value calculated from the 
empirical data was found to be —3.31. However, the error 
bar region range of approximately 2 would make the value 
of —3.31 less certain. Instead of quantifying the odds ratio 
given by Eqn. [3] as having a value of 0.19, we now make a 
statement about its uncertainty by saying that the odds 
ratio can be a value between 0.14 and 0.48. These are 
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Figure 1. Probabilities of different candidate models (Cand) 
being selected over different reference models (Ref) by using AIC 
for different values of ^threshold f° r the case of (a) (top) the can- 
didate model DGP being picked over the reference model ACDM 
(b) (middle) the candidate model ACDM being picked over the 
reference model wCDM, and (c) (bottom) the candidate model 
ACDM being picked over the reference model CPL. The solid 
blue curve shows the number of incorrect results as a fraction 
cases where the procedure returns a determinate 
result. The black dashed curve shows the number of incorrect re- 
sults as a fraction of the total number of simulations. The fraction 
ffaise ls extremely noisy and should be ignored when the fraction 
of determinate cases fi n d (red, dotted) is large. The plots show 
that increasing /^threshold always decreases ff a \ se , f false does 
not necessarily decrease. For a comparison of the models used, 
this shows that AIC tends to incorrectly select models with a 
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Figure 2. Probability of the candidate model (Cand) CPL being 
selected over the reference model (Ref) ACDM for different values 
of the threshold. In this case, /j°J se actually increases sharply 
at about twice the difference in free parameters between these 
models, showing that in this case AIC tends to incorrectly select 
the model with a higher number of parameters. 



just two examples in which one can carry out an analysis 
to determine the reliability of the model likelihood ratio 
P(A)/P(B) that is calculated in Eqn.[3] 

We note that the statistical uncertainty of the AIC differ- 
ences obtained above is smaller than the ^.threshold value 
of 5 already mentioned ab ove (|Liddlel 120071 ; iDavis et all 
120071 ; ISollerman et al.|[2009l ). However, it is still significant 
enough and needs to be considered in our analysis. 



6 CONCLUSION 

AIC has been widely used as a technique for model selec- 
tion. Most commonly, this has been applied by computing 
the AIC values for each candidate model through Eqn. [5] 
and selecting the model with the smallest AIC value as 
the best model. The issue of considering the magnitudes 
of AIC differences between the models to indicate the 
relative plausibility or confidence in the models has also 
been addressed by Akaike and elaborated by Burnham and 
Anderson through the use of Akaike weights Eqn. [3] In the 
field of cosmology, AIC has been used in selecting models 
underlying the late time acceleration of the universe using 
data. There have also been suggestions of a rule of thumb 
that states an AIC difference of 5 or more would give strong 
evidence for the model with the smallest AIC value. This 
approximately corresponds to a ratio of model likelihoods 
of 12 or more. 

In this paper, we propose a method for calculating 
the ratio of the model likelihood between models A and 
B based on their AIC differences Aa,b (Appendix |A| . 
using the idea of probab ilistic model distinguishability by 
iBalasubramanianl (| 19971 ). This is an alternative method 
of arriving at the odds ratio of Eqn. [3] assuming that the 
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Figure 3. Probability distributions of the AIC differences be- 
tween ACDM and DGP (Aj CDM dgp) ^ or different reference 
model C. C is used to generate the respective bootstrap samples 
and is written in the right upper corner of the figures. The hor- 
izontal axis indicates the &%q dm dgp va -l ue while the vertical 
axis indicates their relative frequency. If the process underlying 
our observations was really the best fit model of class C, then the 
values of the AIC differences under different realisations of noise 
would have the histogram distribution {A^ CI)M dgp I ^ ^ C371 } 
shown. Vertical lines show the respective AIC differences that 
were derived from the observed data. 



Figure 4. Probability distributions of the AIC differences be- 
tween ACDM and CPL (A^CDM CPl) ^ or different reference 
model C. C is used to generate the respective bootstrap sam- 
ples and is written in the right upper corner of the figures. The 
horizontal axis indicates the A%q D m qpl vame while the vertical 
axis indicates their relative frequency. If the process underlying 
our observations was really the best fit model of class C, then the 
values of the AIC differences under different realisations of noise 
would have the histogram distribution {A^ CI)M cpl I ^ ^ C37l} 
shown. The exponential-like distributions observed are due to the 
fact that ACDM and CPL are nested models. Vertical lines show 
the respective AIC differences that were derived from the ob- 
served data. 
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AIC difference between the candidate models is a perfect 
unbiased estimator of the difference between the models' 
KL divergences with respect to the truth. A related analysis 
using the AIC differences in accordance with the Akaike 
weights also derives the same equation. The analysis of 
Aa,s was extended further by investigating the statistical 
uncertainty of this estimate. Our focus was not necessarily 
on the 'best' cosmological theory. Thus, we did not use 
the most exhaustive data sets, nor explore in detail the 
systematics associated with the surveys considered. 

To this end, we studied the distribution of the differ- 
ences of AIC estimates given a certain quality of data 
P(Aa,b\En)- Since we do not know the exact process un- 
derlying the empirical data (En in Section [4}, we approach 
this problem by studying surrogate processes, where the 
reference or generating model is assumed to be one of four 
(best fit) candidate models for the late time ac celeration of 
the u niverse; following the approach used by iDavis et alj 
(2007). These models were listed in Table. [T] with the best 
fit free parameters equal to the maximum likelihood values 
of each of these models. This was obtained by fitting 371 
SNIa extracted from the Constitution compilation. 

Our simulations have demonstrated that, given the 
data used, there was insufficient data to reliably use 
AIC to tell all the models apart; in agreement with the 
general consensus (|Davis et al.l [2007). For the case of 
^threshold = 0, the failure rate of the technique was shown 
to be particularly unsatisfactory. We also studied the 
reliability of the AIC technique when Athreshoid is increased 
to 2 and 5. Increasing Athreshoid results in increasing the 
number of cases where we cannot make a conclusion based 
on the AIC procedure. This was demonstrated by find, 
which calculates the fraction of cases when the difference in 
AIC values between two models is less than Athreshoid- We 
also studied f false, the proportion of cases where the AIC 
procedure using a threshold Athreshoid gives an incorrect 
result as a fraction of cases where we can make a conclusion 
(i.e. |Aa,b| > Athreshoid). We showed that ff a \ se does not 
necessarily decrease in the same universal way with an 
increasing Athreshoid- Therefore, even when AIC chooses a 
model class (with a high level of Athreshoid), the result is 
unreliable for at least some models within that model class. 
The demonstrated examples would perhaps not arise if the 
data was good enough that the differences in Xml was large. 

We also calculated the respective statistical uncertainty 
(~ 1 a error bars) of A d A B and showed it was even larger 
than the observed Aa,b between some of the models. This 
gives us a way to gauge the adequacy in the number of 
data points since the statistical uncertainty would become 
smaller than the observed differences when there is a 
sufficient amount of data. As an important example, we 
considered the ACDM and DGP models since they were 
shown to have the two lowest AIC values in Table. [T] It 
was shown that the statistical uncertainty of Aj C£)M DGP 
was larger than the observed Aacdm,dgp, making it 
difficult to determine the better model between the two. 
From our simulation, we also showed that the shapes of 
the distribution of the AIC differences can be quite varied, 
ranging from a symmetric Gaussian-like distribution to an 



exponential-like distribution with a sharp edge and one 
sided tail. Thus, in order to use AIC reliably, one must pay 
proper attention to the statistical variation of Aa.b- 

In this paper, we made a number of assumptions to 
study the AIC technique. All calculations in this paper 
were only for an assumed reference. Since the empirical 
data does not give us Em, there is no way to know the 
actual distribution of Aa,b- However, we should note that 
AIC is a model comparison technique that assumes that 
one of the model classes contains the reference C. By 
restricting C to the set of candidate models, we can at 
least look for statistical self-consistency in that assump- 
tion. It should be emphasised again that the reference 
models used were the best fit models and did not take 
into account the statistical uncertainty of the individual 
model parameters. That can be taken into account by 
sampling the distribution of parameters. As mentioned 
before, the whole point of this simulation is to highlight 
the statistical distribution of the AIC differences under 
different reference models and look for statistical inconsis- 
tencies under each of the assumptions. Another issue that 
should be noted is that the exact variance in the data is 
unknown and that the error bars in the data may not be 
reflective of the true error bounds. While we use these in 
our simulations, we note that the correctness of these error 
estimates was an assumption of previous AIC computations. 

In summary, the reliability of the AIC estimator is an 
important issue that should be taken into consideration 
when using the AIC technique to select models. It should 
also be noted that such considerations are not just restricted 
to AIC but any technique that relies on the maximum 
likelihood estimators. This should be borne in mind when 
applying the techniques to any statistical analysis. 
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APPENDIX A: CONFUSION PROBABILITY 
AND MODEL LIKELIHOOD 

This is an almost identical repeat of iBalasubramanianl 
|l997h explanation of error probabilities which is framed in 
the language of hypothesis testing. It is reproduced for the 
convenience of the reader. Suppose {xi, x%, xn} £ X w 
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are drawn independent and identically distributed (i.i.d) 
variables from one of /i and / 2 with -D(/i H/2) < 00. Let 
iiv C X^ be the acceptance region for the hypothesis that 
the distribution is fx and define the type I and type II er- 
ror probabilities as qjv = f\(A N ) and /3n = f 2 (An) re- 



spectively. A N is the complement of An in 



and f N 

denotes the product distribution on X iv describing N i.i.d 
outcomes drawn from /. In this definition qjv is the proba- 
bility that /1 is mistaken for f 2 , and Pn is the probability 
of the opposite error. Stein's lemma tells us how low we can 
make /3n given a particular value of on- Indeed, let us de- 



5. No other sequence of acceptance regions does bet- 
ter. 

Proof: Let Bn Q X^ be any other sequence region with 
<*n,b n = f?(B r N ) < e. Let /3 n ,b n = f?(B N ) 
Pn,b n 



f?(B N ), 



fine P N = min Ajv c?s 
lemma tells us 



lim lim —In Sir 

£->0iV^oo N 



Pn for a positive e. Then Stein's 



-D(h\\h). 



(Al) 



To prove Stein 's lem ma, we refer to the proof by 
I Cover fc Thomasl ()l99ll L which is provided almost verbatim 
here for convenience sake. Defining 8 £ R + , we first state 
An more explicitly as: 

A N = 

feX": exp[7V (D(h \\h) -8)] < £M < exp[iV (D(h \\f 2 ) 
Then, we have the following properties: 

1. h N (A N ) -> 1. 
Proof: 

h N (A N ) = f? (iE^i logoff} G Will/a) -5,D(h\\h) 
— > 1 by the law of large numbers, since -D(/i||/2) 
/■*. ^log ^ . Therefore, for any positive e, ajv < e for 
large N. 

2. f?(A N ) < exp [-N (D(/i||/ 2 ) - 8)} 
Proof: 

Em*)- 



> f?(A N nB N ), 

= E 

A N nB N 

> E h(x)eM-N(D(h\\f2)+6)], 

A N DB N 

= exp[-N(D(f 1 \\f 2 )+5)] E M*)> 

A N nB N 

> (l-a N -a N ,B N )exp[-N(D(fi\\f2)+6)], 
where the last inequality is due to the following: 

E M x ) = MAnhBn), 



+ 



5)}} 



= i-/i(4uft), 
> i-fi(A N )-h(B N ), 

— 1 — QJV — OlN,B N - 



+ 5) 



Hence, ilog/3^ > -D( fl \\f 2 ) 8 

and since 8 > 0, lim^^ -±- log /3 n ,b n > -D(f 1 \\f 2 ). Thus 
o sequence of sets Bn has an exponent better than 



(Ml/2 



}?(An) 



< Y,h(x)eM-N(D(h\\f2)-8)\, 

A N 

= eM-N(D(f 1 \\h)-S)]^2f 1 (x), 

A N 

= exp[-JV (D(h ||/ a )- *)](! - ajv ). 



3. fi 1 (An) > exp [— iV (D(/i H/2) + 5)] 
Proof: 



= «p[-JV(£>(/i||/ 3 ) + *)]^/i(i), 

A N 

= exp[-JV (£>(/i||/ 3 ) + 5)](1 -«iv). 

4. limjv^oo ilog/?jv = -£>(/i||/ 2 ). 
Proof: 

From 2. and 3. we know: 

— log/?iv < -D(/i||/ 2 ) + 5+ y — '-. 

1 1 a ^ n / * 11 f \ c 1 log(l - Qjv) 

— log/3jv > -D(/i||/ 2 ) -6 + — y — '-. 



In summary, property 1. shows that An is the sequence 
that is generated by /1 in the asymptotic limit. Properties 
2., 3. and 4. derive the error probability of Stein's lemma 
and property 5. shows that An is asymptotically optimal 
and the best error exponent is -D(/i||/2). 

Thus, we can interpret exp [— D (truth||model)] as the 
probability of confusing the model with the truth or m odel 
probability, using the work of I B alasubramanianl (|l997l ). 

This relation allows us to propose a slightly different 
but related way of interpreting the AIC difference between 
models as defining the ratio between model probabilities 
P^g^ without resorting to Akaike weights. Let us start 
with 2 models A and B with fa and fg as their respective 
probability distribution functions. Their AIC values are a 
and b respectively and there is a difference of Aa,b = a, — b 
between their AIC values. 



exp[-A A ,s/2] 



exp[— a/2] 
exp[-fo/2] ' 

exp [E xleo [lo g fA(X\§ A )]\ 
exp[E x]eo [logf B (X\§ B )]\' 

exp [-D (truth||/A(X|<tt))] 

exp [-£> (tTuth\\f B (X\e B ) 

P(A) 
P(B)' 
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where X represents data sampled from the truth 6q . 9 a and 
6b are the maximum likelihood parameters of model A and 
B. 



APPENDIX B: BOOTSTRAP METHOD 

The bootstrap method used in this paper was a paramet- 
ric bootstrap, where we made certain assumptions about 
the parametric relationship between the data input (ex- 
planatory variable) and the data output (response vari- 
able). We start with a set of N actual data points D — 
{(xi,yi,ai),(x2,y2,(T2),—,(xN,yN,crN)}, where yi is the 
response variable that is observed with the error bar cr; 
and Xi the explanatory (input) variable. To produce a boot- 
strap sample set Cn consisting of iV data points, as noted 
in the text, we use a particular model C: y — f(x) with 
the needed parameters chosen by maximum likelihood esti- 
mation. In the usual bootstrap approach, the obtained set 
{Vi ~ f( x i)} is regarded as the estimate of the noise distri- 
bution, but in our case, unfortunately, the noise magnitude 
seems to depend on x. Therefore, we make an example of 
size N bootstrap data as {f(xi) + et}, where Xi are the same 
as in D and the noise ei is a Gaussian random variable obey- 
ing N(Q, erf). With newly generated {e;} we make a set of N 
sample bootstrap data set Cn- For each bootstrap sample 
d € Cn, we estimate the maximum likelihood parameters 
for model A and B, respectively, and we can compute the 
set of AIC differences {A A B \ d £ C N }- 



APPENDIX C: GETTING X ml FROM SNIA, 
MARGINALISING OVER H 

We present the use of AIC in the context of cosmological 
mode l selection using SNI a from the Constitution compila- 
tion (|Hicken et al. 2009b). We fit the theoretical quantity 
of the distance moduli /u(Zi) (z is the observed red shift) 
against its observed value n° bs = m< — Mi, where mi is 
the observed apparent magnitude and Mi is the absolute 
magnitude of the Supernova data. Note that the index i 
indicates the ith data point. 

fi{z) is calculated by the equation fi{z) = 51og 10 , 
where dh{z) is the luminosity distance. We will assume that 
the universe is flat, by setting the curvature term ttk in the 
Hubble function H(z) to zero, and under this assumption 



d L (z) = {l + z)c f 
Jo 



dz 



/o H{z'Y 
where c is the speed of light. 

We start with the assumption that /x° 
sian noise structure. We model it as 



(CI) 



has a Gaus- 



exp 



-E 



( M f s - n{zi)y 



2a? 



(C2) 



where en are consistent with the error bars associated with 
fi° bs in the Constitution compilation. Since fi obs was cal- 
culated from the apparent (observed) magnitude m by as- 



suming a fixed absolute magnitude M value which we are 
actually unsure about. We get around this problem by in- 
troducing a nuisance parameter g and integrate it over a flat 
prior (Gaussian prior where the standard deviation — > oo). 
To do this, we first integrate this over a Gaussian distribu- 
tion of the nuisance parameter with standard deviation a g 
to get 



exp 



E 



2af 



exp 



2a| 



We can rewrite this in matrix form: 



: exp 



(X - gY) 1 K(X - gY) 



: exp 



2<r| 



dg, 



where X is a n- vector whose ith component is /i° bs — n(zi), 
Y is a n- vector where all the elements are 'Is' and A is 
the inverse of the covariance matrix (which in this case is 
diagonal). The T symbol denotes the transpose of a vector. 
Performing the g integral and setting er g to a large value, 
the following marginalised function is obtained: 



1 



1 



: exp 



lx T 

2 



A 



AYY A 
Y T AY 



X 



ilog(a„ 2 r r 



AY)- 



y/a^YTAY n, 

This reduces the log likelihood to x ^ x 
~^2 { | log(27rof ). The second term suffers from a log diver- 
gence as er g — > oo. However, since AIC works by comparing 
relative log likelihood values, we can regularise this term 
away by setting it to zero. We can also ignore the third term 
since it is a fixed constant independent of the parameter 
choice. 

This reduces finding the maximum likelihood to min- 
imising X T CX, where C = A — A yT A ^ ■ Because of the 
marginalisation against a flat prior, the rank of C is smaller 
than the rank of A by one and thus C cannot be inverted. 
The marginalisation procedure also implies that the choice 
of the Hubble parameter Ho and even the speed of light c is 
irrelevant to finding the maximum likelihood values of the 
other parameters. 

This leads to the following relative AIC term: 

AIC = X T (6)CX(§) + 2k, (C3) 



where 9 is the set of parameters that minimises X T CX. 

The first term corresponds to the maximum likelihood 
while the second term is the bias correction which is 
dependent on the number of free parameters. The maxi- 
mum likelihood parameters can then be found using some 
common minimisation procedure. Specifical ly, these wer e 
found using the Gauss-Newton algorithm (|Biorckl Il996l ). 
This allows us to calculate the AIC values for 4 candidate 
models. The data used consists of 371 Supernova events 
taken from the Co nstitution compilation (MLCS table) 
l|Hicken et alj|2009bf ) . We computed the AIC values for the 
different models and found that the DGP model has the 
smallest AIC value among the four models we considered 
in this paper. 
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As a very small technical side issue, it should be noted 
that unlike the other work (jCodlowski fc Szydlowski 120051 ) 
that used AIC as a model selection tool, as described in 
the above, we marginalised away the Ho term against a flat 
prior which reduces the number of free parameters by one. 
This technical difference alone should not affect our use of 
AIC. 
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